Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle ShortDecimal correctly inside importFromArrow #8957

Closed
wants to merge 8 commits into from

Conversation

boneanxs
Copy link
Contributor

@boneanxs boneanxs commented Mar 5, 2024

Arrow uses int128_t to store ShortDecimal values, while inside velox we use int64_t.

ExportToArrow already handle it specifically, but ImportFromArrow misses this.

This pr tries to fix it.

@facebook-github-bot
Copy link
Contributor

Hi @boneanxs!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Copy link

netlify bot commented Mar 5, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 02a6ab6
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65fb92648e11560008d1ba6b

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 5, 2024
@boneanxs
Copy link
Contributor Author

boneanxs commented Mar 6, 2024

Hey @PHILO-HE @rui-mo @pedroerp, pls review this, thanks!

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Just added several comments.

if (type->isShortDecimal()) {
values = AlignedBuffer::allocate<int64_t>(arrowArray.length, pool);
auto rawValues = values->asMutable<int64_t>();
auto oldValues = static_cast<const int128_t*>(arrowArray.buffers[1]);
Copy link
Collaborator

@rui-mo rui-mo Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like arrow array can hold a decimal of a bit-width other than int128_t (int256 is also possible). Maybe we can add a check in Velox importFromArrowImpl to ensure the bit-width is not set or the bit-width is indeed 128. Then it should be safe to cast buffers[1] as int128* here.
https://github.com/apache/arrow/blob/main/cpp/src/arrow/c/bridge.cc#L348-L356

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since type is firstly called here:

auto type = importFromArrow(arrowSchema);
, whereas the decimal type must be int128_t, otherwise importFromArrowImpl could already fail, do we need to add extra check here?

Copy link
Collaborator

@rui-mo rui-mo Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried below cases in the unit test TEST_F(ArrowBridgeSchemaImportTest, scalar), and they both work. Could you help confirm?

EXPECT_EQ(*DECIMAL(10, 4), *testSchemaImport("d:10,4,128"));
EXPECT_EQ(*DECIMAL(10, 4), *testSchemaImport("d:10,4,256"));

velox/vector/arrow/Bridge.cpp Outdated Show resolved Hide resolved
@@ -1102,6 +1102,21 @@ class ArrowBridgeArrayImportTest : public ArrowBridgeArrayExportTest {
}
}

void testShortDecimalImport() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we reuse the method testArrowImport? Maybe a template parameter TOuput and a default argument for the output values need to be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @rui-mo I tried to reuse the method but the codes looks lengthy and not elegant... Could you pls give me more hint in case I miss something We have to pass int128_t as arrowInput while int64_t as vector raw values, and take these values as Decimal type...😣

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented some code in rui-mo@a2d33ff. Please check if it makes sense, thanks.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

velox/vector/arrow/Bridge.cpp Outdated Show resolved Hide resolved
int bitWidth = std::stoi(&format[idx + sz + 1], &sz);
if (bitWidth != 128) {
VELOX_USER_FAIL(
"Conversion failed for '{}'. Velox decimal does not support custom bitwidth.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VELOX_USER_CHECK_EQ(bitWidth, 128, "msg");

@boneanxs boneanxs requested a review from rui-mo March 8, 2024 02:31
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Just two nits.

velox/vector/arrow/Bridge.cpp Outdated Show resolved Hide resolved
@boneanxs boneanxs requested a review from rui-mo March 8, 2024 07:03
Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Let's see if the @mbasmanova @Yuhta have any suggestion.

@boneanxs
Copy link
Contributor Author

Gentle ping @Yuhta @majetideepak @pedroerp , hey, could you pls help review this?

@@ -972,7 +972,17 @@ TypePtr importFromArrowImpl(
// Parse "d:".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for readability, maybe also move the logic to a helper funciton?

case 'd':
  return parseDecimalFormat(format);

Copy link
Contributor

@pedroerp pedroerp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@facebook-github-bot
Copy link
Contributor

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pedroerp
Copy link
Contributor

Like I pointed out above, the test still access out of bounds memory in the string. We got this with ASAN runs on the new tests:

[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ArrowBridgeSchemaImportTest
[ RUN      ] ArrowBridgeSchemaImportTest.scalar

=================================================================
==48809==ERROR: AddressSanitizer: global-buffer-overflow on address 0x000000307f05 at pc 0x000000ab95f9 bp 0x7ffca1dba5a0 sp 0x7ffca1db9d60
READ of size 1 at 0x000000307f05 thread T0
SCARINESS: 12 (1-byte-read-global-buffer-overflow)
    #0 0xab95f8 in __interceptor_strlen.part.0 ubsan.c
    #1 0x7f407300dafd in std::char_traits<char>::length(char const*) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/char_traits.h:371
    #2 0x7f407300a708 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>::basic_string<std::allocator<char>>(char const*, std::allocator<char> const&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/basic_string.h:536
    #3 0x7f407340994b in facebook::velox::(anonymous namespace)::parseDecimalFormat(char const*) fbcode/velox/vector/arrow/Bridge.cpp:939
    #4 0x7f4073406e4b in facebook::velox::(anonymous namespace)::importFromArrowImpl(char const*, ArrowSchema const&) fbcode/velox/vector/arrow/Bridge.cpp:1005
    #5 0x7f4073405efc in facebook::velox::importFromArrow(ArrowSchema const&) fbcode/velox/vector/arrow/Bridge.cpp:1224
    #6 0xa2fcab in facebook::velox::test::(anonymous namespace)::ArrowBridgeSchemaImportTest::testSchemaImport(char const*) fbcode/velox/vector/arrow/tests/ArrowBridgeSchemaTest.cpp:259
    #7 0xa2e85a in facebook::velox::test::(anonymous namespace)::ArrowBridgeSchemaImportTest_scalar_Test::TestBody() fbcode/velox/vector/arrow/tests/ArrowBridgeSchemaTest.cpp:381
    #8 0x7f4080a5689e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) fbsource/src/gtest.cc:2675
    #9 0x7f4080a56124 in testing::Test::Run() fbsource/src/gtest.cc:2692

auto firstCommaIdx = formatStr.find(',', 2);
auto secondCommaIdx = formatStr.find(',', firstCommaIdx + 1);

if (firstCommaIdx == std::string::npos ||
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add extra check here to avoid ASAN issue, cc @pedroerp

@facebook-github-bot
Copy link
Contributor

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pedroerp merged this pull request in dc561a3.

Copy link

Conbench analyzed the 1 benchmark run on commit dc561a35.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
…ator#8957)

Summary:
Arrow uses `int128_t` to store ShortDecimal values, while inside velox we use `int64_t`.

`ExportToArrow` already handle it specifically, but `ImportFromArrow` misses this.

This pr tries to fix it.

Pull Request resolved: facebookincubator#8957

Reviewed By: Yuhta

Differential Revision: D55019687

Pulled By: pedroerp

fbshipit-source-id: 2fe32236a0e17a52ef713cff96836a48a37fec56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants